Skip to content

Added support for overriding tensor buffer types #2007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zpin
Copy link

@zpin zpin commented Apr 29, 2025

Equivalent to the -ot llama.cpp argument:

{"--override-tensor", "-ot"}, "<tensor name pattern>=<buffer type>,...",

Can be passed as an optionlal string to the Llama class using the new override_tensor parameter. Same format as the argument above.

Provides more control over how memory is used, letting you selectively place specific tensors on different devices, especially helpful when running large MOE models.

@ACupofAir
Copy link

Could you offer the usage of this parameter.
I have try it with

python3 -m llama_cpp.server --model /home/LLM/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf --port 8002 --verbose True --n_gpu_layers 99 ---tensor_buft_overrides exp=CPU
# and 
python3 -m llama_cpp.server --model /home/arda/LLM/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf --port 8002 --verbose True --n_gpu_layers 99 ---override_tensor exp=CPU 

both of them lead to error, here is the log:

usage: __main__.py [-h] [--model MODEL] [--model_alias MODEL_ALIAS] [--n_gpu_layers N_GPU_LAYERS]
                   [--split_mode SPLIT_MODE] [--main_gpu MAIN_GPU] [--tensor_split [TENSOR_SPLIT ...]]
                   [--vocab_only VOCAB_ONLY] [--use_mmap USE_MMAP] [--use_mlock USE_MLOCK]
                   [--kv_overrides [KV_OVERRIDES ...]] [--rpc_servers RPC_SERVERS] [--seed SEED] [--n_ctx N_CTX]
                   [--n_batch N_BATCH] [--n_ubatch N_UBATCH] [--n_threads N_THREADS]
                   [--n_threads_batch N_THREADS_BATCH] [--rope_scaling_type ROPE_SCALING_TYPE]
                   [--rope_freq_base ROPE_FREQ_BASE] [--rope_freq_scale ROPE_FREQ_SCALE]
                   [--yarn_ext_factor YARN_EXT_FACTOR] [--yarn_attn_factor YARN_ATTN_FACTOR]
                   [--yarn_beta_fast YARN_BETA_FAST] [--yarn_beta_slow YARN_BETA_SLOW]
                   [--yarn_orig_ctx YARN_ORIG_CTX] [--mul_mat_q MUL_MAT_Q] [--logits_all LOGITS_ALL]
                   [--embedding EMBEDDING] [--offload_kqv OFFLOAD_KQV] [--flash_attn FLASH_ATTN]
                   [--last_n_tokens_size LAST_N_TOKENS_SIZE] [--lora_base LORA_BASE] [--lora_path LORA_PATH]
                   [--numa NUMA] [--chat_format CHAT_FORMAT] [--clip_model_path CLIP_MODEL_PATH] [--cache CACHE]
                   [--cache_type CACHE_TYPE] [--cache_size CACHE_SIZE]
                   [--hf_tokenizer_config_path HF_TOKENIZER_CONFIG_PATH]
                   [--hf_pretrained_model_name_or_path HF_PRETRAINED_MODEL_NAME_OR_PATH]
                   [--hf_model_repo_id HF_MODEL_REPO_ID] [--draft_model DRAFT_MODEL]
                   [--draft_model_num_pred_tokens DRAFT_MODEL_NUM_PRED_TOKENS] [--type_k TYPE_K] [--type_v TYPE_V]
                   [--verbose VERBOSE] [--host HOST] [--port PORT] [--ssl_keyfile SSL_KEYFILE]
                   [--ssl_certfile SSL_CERTFILE] [--api_key API_KEY] [--interrupt_requests INTERRUPT_REQUESTS]
                   [--disable_ping_events DISABLE_PING_EVENTS] [--root_path ROOT_PATH] [--config_file CONFIG_FILE]
__main__.py: error: unrecognized arguments: ---tensor_buft_overrides exp=CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants